Textpresso - an Information Retrieval and Extraction System for Biological Literature

نویسندگان

  • Hans-Michael Müller
  • Arun Rangarajan
  • Tracy K. Teal
  • Kimberly Van Auken
  • Juancarlos Chan
  • Paul W. Sternberg
چکیده

We developed an information retrieval and extraction system that processes the full text of biological papers. The system, called Textpresso, separates text into sentences, labels words and phrases according to an ontology (an organized lexicon), and allows queries to be performed on a database of labeled sentences. The current ontology comprises approximately one hundred categories of terms, such as " gene " , " regulation " , " human disease " , " brain area " etc., and also contains main Gene Ontology (GO) categories. Extraction of particular biological facts, such as gene­gene interactions, or the curation of GO cellular components, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences. Search engine for four literatures, C. elegans, Drosophila, Arabidopsis and Neuroscience have been established by us, and thirteen systems for other literatures have been developed by other groups around the world. Currently, our four systems contain 112,000 papers with 40 million sentences, all systems worldwide contain 190,000 papers with approximately 65 million sentences.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Textpresso text mining:

Manual curation of experimental data from the biomedical literature is expensive and time-consuming; however, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. We have developed and actively use a category-based information retrieval and extraction system for curating C. elegans proteins to the Gene Ontology's Cellular Component Ontology. The s...

متن کامل

Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature

We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched....

متن کامل

Adaptation of an Open Source Semantic and Conceptual Retrieval Framework to the Astrobiological Domain

Introduction: Astrobiology is by nature a system-level science, meaning that it is concerned with complex , multidisciplinary, multi-phenomena behaviors of large physical and biological systems. Due to the breadth of the undertakings in astrobiological inquiry, researchers in the field must rely heavily on information technology to consolidate and represent knowledge and data from across many d...

متن کامل

AIRFrame: Astrobiology Integrative Research Framework

Introduction: Astrobiology is by nature a system-level science , meaning that it is concerned with complex, multidisciplinary, multiphenomena behaviors of large physical and biological systems. Due to the breadth of the undertakings in astrobiolog-ical inquiry, researchers in the field must rely heavily on information technology to consolidate and represent knowledge and data from across many d...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008